In most application fields and disciplines, the continuous increase in the sample size of spatial data, both in terms of volume and richness of explanatory variables, has raised problems in the use of Geographically Weighted Regression (GWR, Brunsdon et al. 1996; McMillen, 1996) in recent years. As the spatial extent of a sample increases, its spatial resolution and the richness of the explanatory variables also increase and it becomes increasingly necessary to take spatial heterogeneity into account. This can impose computation times that may seem prohibitive for GWR methods. In 2017, the use of GWR methods on samples of more than 50,000 observations was not really feasible and some authors have proposed improvements to meet the challenges of using GWR with bigdata. The two main problems concern here the time required to calculate each local coefficients and the memory requirements imposed for storing the hat matrix of size \(n \times n\) for estimating variance of parameters. To answer to these two issues, various avenues have been explored:
Indeed, one can use a limited set of evaluation points (or target points) and interpolation methods to generalize the fit to others points for speeding up the computation. In the more general context of locally weighted regression, Loader (1999) proposed an interpolation method based on an adaptive decision tree approach using local density of points. In non-parametric literature, this type of two stages strategy that consists to approximate a function on a limited set of points and to extrapolate it is fairly usual: the famous Super Smoother algorithm of Friedman (1984) is based on this idea. There are a wide variety of proposals for KDEs and regression kernels on how to choose a subset of points to obtain the best extrapolations with the least computational load possible. It could be based on the density of observations, the complexity of curves or by equidistributing them on the support of the function.
The release 1.0 of mgwrsar package proposes two improvements to speed up the computation of the GWR like models considered in the previous release of mgwrsar package, with or without spatial autocorrelation, through the joint use of Gaussian rough kernel and estimations based on target points subset. It proposes:
If we consider a sample of 4000 observations, the use of Gaussian rough kernel accelerates GWR computation by a factor of 2.5 without significant bias in the coefficients. Moreover, the use of suitable target points set accelerates GWR computation by a factor of 50 and surprisingly reduces the bias for final local coefficients by improving the choice of optimal bandwidth (see Geniaux 2024).
This package proposes two ways of selecting target points, first by using a quadcell algorithm based on the local density of locations, second by considering spatial smooth of residuals of a first stage OLS model. The later method outperforms in general the equidistributed method based on the local density of locations, particularly when the size of target points is smaller than 25 % of the locations. This way of selecting target points is proposed as default method in this release.
library(mgwrsar)
#> Loading required package: sp
#> Loading required package: Matrix
#> Registered S3 method overwritten by 'future':
#> method from
#> all.equal.connection parallelly
## loading data example
data(mydata)
coord=as.matrix(mydata[,c("x","y")])
head(coord)
#> x y
#> [1,] 872142.8 6234075
#> [2,] 872894.3 6233366
#> [3,] 878615.1 6229884
#> [4,] 869936.0 6228408
#> [5,] 866441.8 6231471
#> [6,] 866952.9 6222411
head(mydata)
#> Y_mgwrsar_1_0_kv X0 X1 X2 X3 Beta0 Beta1
#> 1 0.44146014 1 2.396274 0.4763857 0.6854775 0.11599551 0.26922206
#> 2 0.02223952 1 1.684776 0.3303000 0.8994048 0.11533789 0.25784575
#> 3 3.41607993 1 2.369278 1.5117796 0.5808293 0.20973774 0.55723064
#> 4 0.28497090 1 4.208743 1.0916590 1.4058201 -0.08706293 0.06795084
#> 5 -0.79417336 1 3.109543 0.6487661 1.6977387 -0.08209832 0.21763086
#> 6 6.78538775 1 7.284831 0.8828887 1.5085908 -0.28330886 0.52468859
#> Beta2 Beta3 lambda Y_ols Y_sar Y_gwr Y_mgwr
#> 1 0.6772792 -1.1106223 0.2408643 0.9890454 2.010387 0.32246490 0.39829402
#> 2 0.6751576 -1.0326101 0.2273072 0.2732833 1.539485 -0.15597959 -0.12664995
#> 3 0.5623073 -0.5412975 0.3384363 2.1155894 3.637831 2.06565538 1.79922754
#> 4 1.0796267 -0.9258913 0.3543732 1.7902102 3.201578 0.07587226 -0.02831127
#> 5 1.1019372 -1.2759568 0.3772657 0.5057989 1.460611 -0.85670748 -0.38820500
#> 6 1.5488474 -0.7649646 0.4933481 3.0167134 4.654380 3.75240014 3.39782791
#> Y_mgwrsar_0_0_kv Y_mgwrsar_0_kc_kv Y_mgwrsar_1_kc_kv Y_mgwr_outlier X2dummy
#> 1 0.5673431 0.7137761 0.55150700 0.3982940 FALSE
#> 2 0.2031676 0.2611103 0.06388119 -0.1266500 TRUE
#> 3 3.8183989 3.1882294 2.86908705 1.7992275 TRUE
#> 4 0.3314998 0.1832260 0.14519215 -0.0188881 TRUE
#> 5 -0.7858941 -0.1114446 -0.13694311 -0.3882050 FALSE
#> 6 5.8227160 5.2859121 6.16333174 3.3978279 TRUE
#> Y_mgwr_X2dummy x y
#> 1 0.07564794 872142.8 6234075
#> 2 -0.34965450 872894.3 6233366
#> 3 0.94914284 878615.1 6229884
#> 4 -1.20689547 869936.0 6228408
#> 5 -1.10310449 866441.8 6231471
#> 6 2.03036799 866952.9 6222411
## plot
ggplot(mydata,aes(x,y))+geom_point(aes(color=Beta0))+ ggtitle('Spatial Pattern of coefficient Beta0')+scale_colour_gradientn(colours = terrain.colors(10))
ggplot(mydata,aes(x,y))+geom_point(aes(color=Beta1))+ ggtitle('Spatial Pattern of coefficient Beta1')+scale_colour_gradientn(colours = terrain.colors(10))
The GWR model can be expressed as: \[\begin{equation} y_{i}=\sum_{j=1}^{J}\beta_{j}(u_{i},v_{i})x_{ij}+\epsilon_{i} \;\;\; i=1,2,..,n \label{modelGWR} \end{equation}\] where \(J\) is the number of exogenous regressors, \(x_{ij}\) is the regressor \(j\) for the observation \(i\) and may be an intercept.
The estimator for \(\beta_{j}(u_{i},v_{i})\) is : \[\begin{equation} \hat{\beta}(u_{i},v_{i})=(X'W_{i}X)^{-1} X'W_{i}Y \label{BetaGWR} \end{equation}\]
If we note \(W\) a spatial weight matrix based on a distance decay kernel or using k-nearest neighbors, then \(W_{i}\) is the diagonal spatial weight matrix \(n\times n\) specific to location \(i\), with a diagonal composed by the ith row of \(W\) and zero for all others elements. Usual kernels are the Gaussian and bisquare kernels, with a bandwidth defined in distance or in number or neighbours for adaptive kernels.
One way to speed up GWR model estimation and to reduce the required amount of memory consists in increasing sparsity of the weight matrix \(W\). When one use \(k\) first neighbors weighting scheme or an adaptive kernel with null weights beyond bandwidth (like bisquare, trisquare, epanelechnikov, triangle or rectangle), then \(W\) is very sparse when the bandwidth is small. When Gaussian like kernels are used then one way to increase sparsity of \(W\) is to use truncated Gaussian kernels that shrink to zero weights when weights are very small or when it concerns observations beyond a given number of neighbors.
In the first example, we show how to compute a GWR model without or with rough Gaussian kernel: we use an optimal bandwidth \(H\) of 0.3, corresponding on average to the distance to the 25th neighbor and for the rough Gaussian kernel, only the first 300 neighbors (NN=300) are taken into account in the distance and weight calculations. We can observe in this case that the computation time is divided by 2. The larger the sample, the greater the relative improvement. We notice in this example where the optimal bandwidth is far from the bound used to truncate the kernel, that the use of a rough Gaussian kernel has no detrimental effect. When the optimal bandwidth is closer to the neighborhood used to truncate the Gaussian kernel, then the differences in the approximation will be larger and it may be recommended to increase the \(NN\) parameter that defines the truncation neighborhood of the Gaussian kernel.
## without rough gaussian kernel
ptm1<-proc.time()
model_GWR<-MGWRSAR(formula = 'Y_gwr~X1+X2+X3', data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'),H=0.03, Model = 'GWR',control=list(SE=TRUE))
(proc.time()-ptm1)[3]
#> elapsed
#> 0.292
## with rough gaussian kernel
ptm1<-proc.time()
model_GWR_grk<-MGWRSAR(formula = 'Y_gwr~X1+X2+X3', data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'),H=0.03, Model = 'GWR',control=list(SE=TRUE,NN=300))
(proc.time()-ptm1)[3]
#> elapsed
#> 0.029
summary(model_GWR@Betav)
#> Intercept X1 X2 X3
#> Min. :-1.485 Min. :0.5 Min. :2.548 Min. :-1.076
#> 1st Qu.:-1.485 1st Qu.:0.5 1st Qu.:2.548 1st Qu.:-1.076
#> Median :-1.485 Median :0.5 Median :2.548 Median :-1.076
#> Mean :-1.485 Mean :0.5 Mean :2.548 Mean :-1.076
#> 3rd Qu.:-1.485 3rd Qu.:0.5 3rd Qu.:2.548 3rd Qu.:-1.076
#> Max. :-1.485 Max. :0.5 Max. :2.548 Max. :-1.076
summary(model_GWR_grk@Betav)
#> Intercept X1 X2 X3
#> Min. :-1.485 Min. :0.5 Min. :2.548 Min. :-1.076
#> 1st Qu.:-1.485 1st Qu.:0.5 1st Qu.:2.548 1st Qu.:-1.076
#> Median :-1.485 Median :0.5 Median :2.548 Median :-1.076
#> Mean :-1.485 Mean :0.5 Mean :2.548 Mean :-1.076
#> 3rd Qu.:-1.485 3rd Qu.:0.5 3rd Qu.:2.548 3rd Qu.:-1.076
#> Max. :-1.485 Max. :0.5 Max. :2.548 Max. :-1.076
In the second example, we show how to compute a GWR model with sample of target points using mgwrsar 1.0. Choosing target points with respect to the spatial distribution of residuals of a first stage OLS regression is preferable to select target point with respect to density of locations; this option can be choosen using type=‘residuals’ as control variable in mgwrsar function.
The selection of target points with type=‘residuals’ uses 3 steps:
In order to get local maxima at fine scale, we use between twelve and sixteen neighbors (default) for \(ks\). In the last step, the value of \(kt\) allow to increase/decrease the number of local maxima chosen. The choice of \(ks\) is less critical than \(kt\). The resulting number of target points depends both on \(kt\) and on the spatial distribution of residuals. The more \(kt\) increases, the more the number of target points decreases: the size of resulting target points set is less than \(n/kt\).
To illustrate the methodology, we simulate a single run of the GWR DGP proposed by Geniaux and Martinetti (2017a) and apply the selection of target points with respect to first stage OLS residuals. The two figures 1 maps the value of the absolute value of the smoothed (with \(ks = 16\)) residuals of the linear model, and the chosen target points when \(kt\) is set to 48 (green points), and when parameter \(kt\) is set to 8 (blue points). We can see on the map on the left that the absolute value of the smoothed residual of the green target points is higher than for their 48 closest neighbors.
In the following example, the function \(find_{TP}\) is used to identify a target point set with \(kt\)=6 that leads to choose 91 targets points, that are then used in MGWRSAR function : MGWRSAR function first compute the local \(\beta(u_i,v_i)\) coefficients for the 91 sub-samples (\(i %in% TP\)) and then extrapolate the local \(\beta(u_i,v_i)\) coefficients for the 909 other locations.
As expected approximation of \(\beta(u_i,v_i)\) coefficients using this target point set are different from those obtained with full GWR, with small difference for the mean of coefficients, but with smaller range and variance, for result with target points. The time calculation is divided by 10 and, as mentionned before, the larger the sample, the greater the relative improvement is, with a time calculation divided by 50 for big sample, particularly if this method is used in conjunction with rough gaussian kernel (see Geniaux 2024).
## identify the target points
TP=find_TP(formula = 'Y_gwr~X1+X2+X3', data =mydata,coords=coord,kt=6,type='residuals',ks=12)
# only 91 targets points are used
length(TP)
#> [1] 92
## comparison of computation times
ptm1<-proc.time()
# benchmark GWR without target points
model_GWR<-MGWRSAR(formula = 'Y_gwr~X1+X2+X3', data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'),H=10000, Model = 'GWR',control=list(SE=TRUE))
(proc.time()-ptm1)[3]
#> elapsed
#> 0.165
ptm1<-proc.time()
# GWR with target points
model_GWR_tp<-MGWRSAR(formula = 'Y_gwr~X1+X2+X3', data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'),H=10000, Model = 'GWR',control=list(SE=TRUE,TP=TP))
(proc.time()-ptm1)[3]
#> elapsed
#> 0.283
# GWR with target points and rough gaussian kernel
ptm1<-proc.time()
model_GWR_tp_NN<-MGWRSAR(formula = 'Y_gwr~X1+X2+X3', data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'),H=10000, Model = 'GWR',control=list(SE=TRUE,TP=TP,NN=300))
(proc.time()-ptm1)[3]
#> elapsed
#> 0.064
## comparison of betas estimates
#GWR
summary(model_GWR@Betav)
#> Intercept X1 X2 X3
#> Min. :-1.698 Min. :0.4560 Min. :2.092 Min. :-1.3297
#> 1st Qu.:-1.497 1st Qu.:0.4628 1st Qu.:2.290 1st Qu.:-1.1445
#> Median :-1.345 Median :0.4704 Median :2.479 Median :-1.0669
#> Mean :-1.375 Mean :0.4721 Mean :2.483 Mean :-1.0662
#> 3rd Qu.:-1.242 3rd Qu.:0.4801 3rd Qu.:2.675 3rd Qu.:-0.9841
#> Max. :-1.154 Max. :0.5086 Max. :2.947 Max. :-0.8345
# GWR with target points
summary(model_GWR_tp@Betav)
#> Intercept X1 X2 X3
#> Min. :-1.674 Min. :0.4564 Min. :2.103 Min. :-1.3136
#> 1st Qu.:-1.444 1st Qu.:0.4721 1st Qu.:2.441 1st Qu.:-1.0890
#> Median :-1.394 Median :0.4738 Median :2.510 Median :-1.0606
#> Mean :-1.396 Mean :0.4741 Mean :2.508 Mean :-1.0595
#> 3rd Qu.:-1.349 3rd Qu.:0.4760 3rd Qu.:2.575 3rd Qu.:-1.0287
#> Max. :-1.155 Max. :0.5038 Max. :2.938 Max. :-0.8473
# GWR with target points and rough gaussian kernel
summary(model_GWR_tp_NN@Betav)
#> Intercept X1 X2 X3
#> Min. :-2.0514 Min. :0.2226 Min. :-0.2742 Min. :-1.5937
#> 1st Qu.:-1.5547 1st Qu.:0.4399 1st Qu.: 1.8726 1st Qu.:-1.2597
#> Median :-1.2084 Median :0.4769 Median : 2.2554 Median :-1.0454
#> Mean :-1.2135 Mean :0.4654 Mean : 2.2534 Mean :-1.0471
#> 3rd Qu.:-0.9532 3rd Qu.:0.4985 3rd Qu.: 2.6743 3rd Qu.:-0.8123
#> Max. : 0.8603 Max. :0.5678 Max. : 3.3180 Max. :-0.5971
The two approximation methods proposed before can be use to reduce time for finding the optimal bandwith, and are available in the bandwiths_MGWRASR function. The next example compares the time for finding an optimal bandwidth for full GWR and for GWR with target points sample and rough gaussian kernel.
Time for finding the optimal bandwith is divided by around 8, and here again, larger improvements can be achieved when sample size are larger. In this example, the optimal bandwidth is smaller for methods with target points: this result is specific to this simulation and the opposite can be true with different data. If GWR with target points reached lower LOOCV in this example, it can not be compared to the LOOCV of full GWR because computation of Leave-One-Out Cross validation criteria (LOOCV) that are comparable between full GWR and GWR with target point is extremely complex and time consuming, mgwrsar 1.0 use only the LOOCV of target points for choosing the optimal bandwidth. If the user want to compare out-sample accuracy of the two methods, k-fold cross validation can be performed to compare the GWR method with and without target points.
## identify optimal bandwidth for a GWR model non-adaptive kernel (the bandwidth is a distance)
res1=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=F),lower.bound=500, upper.bound=100000,tolerance=0.001,show_progress=FALSE)
res1$minimum
#> [1] 704.6604
res1$objective
#> [1] -2068.684
## identify optimal bandwidth for a GWR model adaptive gaussian kernel (the bandwidth is a number of neighbors)
res2=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res2$minimum
#> [1] 5
res2$objective
#> [1] -1767.636
model_opt<-res2$model
summary(model_opt)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars,
#> kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#>
#> Model: GWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : gauss (Adaptive / Neighbors)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 5
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.028 sec
#> Use of Target Points: NO
#> Use of rough kernel approximation: YES (300 neighbors)
#> Parallel computing: FALSE (ncore = 1)
#> Number of data points: 1000
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Varying Parameters]
#> Intercept X1 X2 X3
#> Min. :-3.5205 Min. :0.02844 Min. :-0.6381 Min. :-1.9684
#> 1st Qu.:-1.4767 1st Qu.:0.35513 1st Qu.: 1.1184 1st Qu.:-1.2924
#> Median :-0.8353 Median :0.54312 Median : 1.6613 Median :-1.0021
#> Mean :-0.8216 Mean :0.50003 Mean : 1.6683 Mean :-1.0021
#> 3rd Qu.:-0.2212 3rd Qu.:0.63566 3rd Qu.: 2.4780 3rd Qu.:-0.7176
#> Max. : 1.7964 Max. :0.92955 Max. : 3.6503 Max. :-0.1288
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 653.73
#> AICc: -1767.64
#> Residual sum of squares: 3.45
#> RMSE: 0.0588
#> ------------------------------------------------------
## target points with AICc criteria
#identify the target points
TP=find_TP(formula = 'Y_gwr~X1+X2+X3', data =mydata,coords=coord,kt=6,type='residuals')
length(TP)
#> [1] 91
# identify the optimal bandwidth using GWR estimations with target points and AICc criterion
res3=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T,TP=TP),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res3$minimum
#> [1] 15
res3$objective
#> [1] -234.2819
model_opt<-res3$model
summary(model_opt)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars,
#> kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#>
#> Model: GWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : gauss (Adaptive / Neighbors)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 15
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.217 sec
#> Use of Target Points: YES
#> Number of Target Points: 91
#> Use of rough kernel approximation: YES (300 neighbors)
#> Parallel computing: FALSE (ncore = 1)
#> Number of data points: 1000
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Varying Parameters]
#> Intercept X1 X2 X3
#> Min. :-2.9461 Min. :0.09527 Min. :-0.7167 Min. :-1.8050
#> 1st Qu.:-1.5492 1st Qu.:0.39005 1st Qu.: 1.3385 1st Qu.:-1.3030
#> Median :-0.9803 Median :0.55775 Median : 1.6972 Median :-1.0123
#> Mean :-0.8659 Mean :0.50634 Mean : 1.7297 Mean :-1.0131
#> 3rd Qu.:-0.4215 3rd Qu.:0.63268 3rd Qu.: 2.5500 3rd Qu.:-0.7576
#> Max. : 1.4875 Max. :0.86435 Max. : 3.3842 Max. :-0.2177
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 820.91
#> AICc: -234.28
#> Residual sum of squares: 29.89
#> RMSE: 0.1729
#> ------------------------------------------------------
## target points with CV criteria
# identify the optimal bandwidth using GWR estimations with target points and CV criterion
# for a gaussian kernel the support is non-finite (NN allows to truncate)
res4=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='CV',adaptive=T,TP=TP),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res4$minimum
#> [1] 8
## comparison with other adaptive kernels, with criterion AICc and target points
# epanechnikov kernel ('epane')
res5=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('epane'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T,TP=TP),lower.bound=8, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res5$minimum
#> [1] 48
res5$objective
#> [1] -359.1012
# triangle kernel ('triangle') (NN is the max number of neighbors used to compute the kernel)
res6=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('triangle'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T),lower.bound=8, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res6$minimum
#> [1] 21
res6$objective
#> [1] -1866.436
# bisquare kernel ('bisq')
res7=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = mydata,coords=coord, fixed_vars=NULL,kernels=c('bisq'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T,TP=TP),lower.bound=8, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res7$minimum
#> [1] 74
res7$objective
#> [1] -329.2572
In this example, we use an in-sample of size 800 for model estimation and an out-sample of size 200 for prediction. Note that for model with spatial autocorrelation you must provide a global weight matrix of size 1000, ordered by in-sample then out-sample locations.
For GWR and MGWR_1_0_kv, where all coefficients vary in space, the predictions can carried out using the corresponding model estimate with the out-sample location as target points, so neither the estimated coefficients nor the target points choosen for estimating the model are not used: only the call and the data of the estimated model are used. For mixed model that have some constant coefficients (MGWR, MGXWR_0_0_kv, MGXWR_1_kc_kv, MGXWR_1_kc_0), the estimated coefficients (only the one of target points) are extrapolated using a weighting matrix. By default, the weighting matrix for prediction is based on the expected weights of outsample data if they were had been added to in-sample data to estimate the corresponding MGWRSAR (Geniaux 2024). The user can also choose to extrapolate the model coefficients using a shepperd kernel with custom number of neighbors or using the same kernel and bandwidth as the estimated model.
In the following examples, whatever is the model the out-sample coefficients are extrapolated using a weighting matrix (method_pred=‘tWtp_model’).
## build insample and outsample folds
length_out=800
index_in=sample(1:1000,length_out)
index_out=(1:1000)[-index_in]
#insample data
data_in=mydata[index_in,]
coord_in=coord[index_in,]
#outsample data
newdata=mydata[index_out,]
newdata_coords=coord[index_out,]
newdata$Y_gwr=0
### benchmark: GWR model
# GWR model estimate on insample data, bandwidth optimization with CV
res_CV=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('bisq'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='CV',adaptive=T),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res_CV$minimum
# GWR model estimate on insample data, bandwidth optimization with AICc
res_AICc=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
res_AICc$minimum
### GWR model with target points
# identification of target points
TP=find_TP(formula = 'Y_gwr~X1+X2+X3', data =data_in,coords=coord_in,ks=16,kt=4,type='residuals')
# only 158 targets points are used
length(TP)
## bandwidth optimization with target points
#cross-validation
resTP_CV=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='CV',adaptive=T,TP=TP),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
resTP_CV$minimum
# AICc
resTP_AICc=golden_search_bandwidth(formula='Y_gwr~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(verbose=F,NN=300,criterion='AICc',adaptive=T,TP=TP),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
resTP_AICc$minimum
### Predictions
### without TP
# with CV
Y_pred=predict(res_CV$model, newdata=newdata, newdata_coords=newdata_coords,method_pred='tWtp_model')
head(Y_pred)
head(mydata$Y_gwr[index_out])
sqrt(mean((mydata$Y_gwr[index_out]-Y_pred)^2))
# with AICc
Y_pred=predict(res_AICc$model, newdata=newdata, newdata_coords=newdata_coords,method_pred='tWtp_model')
head(Y_pred)
head(mydata$Y_gwr[index_out])
sqrt(mean((mydata$Y_gwr[index_out]-Y_pred)^2))
### with TP
# with CV
Y_pred=predict(resTP_CV$model, newdata=newdata, newdata_coords=newdata_coords,method_pred='tWtp_model')
head(Y_pred)
head(mydata$Y_gwr[index_out])
sqrt(mean((mydata$Y_gwr[index_out]-Y_pred)^2))
# with AICc
Y_pred=predict(resTP_AICc$model, newdata=newdata, newdata_coords=newdata_coords,method_pred='tWtp_model')
head(Y_pred)
head(mydata$Y_gwr[index_out])
sqrt(mean((mydata$Y_gwr[index_out]-Y_pred)^2))
Prediction of MGWRSAR_1_0_kv model using TP/ no TP and Best Linear Unbiased Predictor (BLUP) :
### Spatial weigths matrices
##Global Spatial Weight matrix W
#Global Spatial Weight matrix W should be ordered by in_ sample (S) then out_sample
W=kernel_matW(H=4,kernels='rectangle',coords=rbind(coord[index_in,],coord[index_out,]),NN=4,adaptive=TRUE,diagnull=TRUE)
# W insample
W_in=kernel_matW(H=4,kernels='rectangle',coords=coord[index_in,],NN=4,adaptive=TRUE,diagnull=TRUE)
### identification of target points
TP=find_TP(formula = 'Y_mgwrsar_1_0_kv~X1+X2+X3', data =data_in,coords=coord_in,kt=4,type='residuals')
length(TP)
### estimate models
## model MGWRSAR_1_0_kv, adaptive gaussian kernel, target points, criterion='AICc'
res_MGWRSAR_1_0_kv_TP=golden_search_bandwidth(formula='Y_mgwrsar_1_0_kv~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('gauss'), Model ='MGWRSAR_1_0_kv',control=list(W=W_in,verbose=F,NN=300,criterion='AICc',adaptive=T,TP=TP),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
#optimal bandwidth
res_MGWRSAR_1_0_kv_TP$minimum
summary(res_MGWRSAR_1_0_kv_TP$model)
## model MGWRSAR_1_0_kv, adaptive gaussian kernel, criterion='CV'
res_MGWRSAR_1_0_kv=golden_search_bandwidth(formula='Y_mgwrsar_1_0_kv~X1+X2+X3',Ht=NULL,data = data_in,coords=coord_in, fixed_vars=NULL,kernels=c('gauss'), Model ='MGWRSAR_1_0_kv',control=list(W=W_in,verbose=F,NN=300,criterion='AICc',adaptive=T),lower.bound=2, upper.bound=300,tolerance=0.001,show_progress=FALSE)
#optimal bandwidth
res_MGWRSAR_1_0_kv$minimum
summary(res_MGWRSAR_1_0_kv$model)
### Predictions
# build outsample fold
newdata=mydata[index_out,]
newdata_coord=coord[index_out,]
newdata$Y_mgwrsar_1_0_kv=0
## without Best Linear Unbiased Predictor (type='YTC')
## without TP
Y_pred=predict(res_MGWRSAR_1_0_kv$model, newdata=newdata, newdata_coords=newdata_coord,W=W,type='YTC')
head(Y_pred)
RMSE_YTC=sqrt(mean((mydata$Y_mgwrsar_1_0_kv[index_out]-Y_pred)^2))
RMSE_YTC
## with TP
Y_pred=predict(res_MGWRSAR_1_0_kv_TP$model, newdata=newdata, newdata_coords=newdata_coord,W=W,type='YTC')
head(Y_pred)
RMSE_YTC=sqrt(mean((mydata$Y_mgwrsar_1_0_kv[index_out]-Y_pred)^2))
RMSE_YTC
## Using Best Linear Unbiased Predictor (type='BPN')
## without TP
Y_pred=predict(res_MGWRSAR_1_0_kv$model, newdata=newdata, newdata_coords=newdata_coord,W=W,type='BPN')
head(Y_pred)
RMSE_YTC=sqrt(mean((mydata$Y_mgwrsar_1_0_kv[index_out]-Y_pred)^2))
RMSE_YTC
## with TP
Y_pred=predict(res_MGWRSAR_1_0_kv_TP$model, newdata=newdata, newdata_coords=newdata_coord,W=W,type='BPN')
head(Y_pred)
RMSE_YTC=sqrt(mean((mydata$Y_mgwrsar_1_0_kv[index_out]-Y_pred)^2))
RMSE_YTC
Brunsdon, C., Fotheringham, A., Charlton, M., 1996. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr. Anal. 28 (4), 281–298.
Friedman, J. H. 1984. A variable span smoother. Laboratory for Computational Statistics, Department of Statistics, Stanford University: Technical Report(5).
Geniaux, G. and Martinetti, D. 2018. A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban Economics, 72, 74-85. https://doi.org/10.1016/j.regsciurbeco.2017.04.001
Geniaux, G. and Martinetti, D. 2017b. R package mgwrsar: Functions for computing (mixed) geographycally weighted regression with spatial autocorrelation,. https://CRAN.R-project.org/package=mgwrsar.
Geniaux, G. 2024. Speeding up estimation of spatially varying coefficients models. Jounal of Geographical System 26, 293–327. https://doi.org/10.1007/s10109-024-00442-3
Loader, C. 1999. Local regression and likelihood, volume 47. springer New York.
McMillen, D. P. 1996. One hundred fifty years of land values in chicago: A nonparametric approach. Journal of Urban Economics, 40(1):100 – 124.